Lab Assignment Five: Wide and Deep Network Architectures

Team: Mike Wisniewski

Dataset Selection

Select a dataset similarly to lab one. That is, the dataset must be table data. In terms of generalization performance, it is helpful to have a large dataset for building a wide and deep network. It is also helpful to have many different categorical features to create the embeddings and cross-product embeddings. It is fine to perform binary classification, multi-class classification, or regression.

The dataset used for this excercise will be the bank and marketing dataset provided by UCI: https://archive.ics.uci.edu/ml/datasets/bank+marketing

The specific dataset used from the above source will be the bank-additional-full.csv which is a dataset with 41,188 examples and 20 features + 1 output. This set differs from the original bank-full.csv by adding 3 additional features which are used to represent economic factors (which may or may not be useful). The data within is dated from May 2008 to November 2010. This data is from a Portuguese banking institution and was gathered as a result of marketing campaigns.

I will not go into what each feature is specifically until we do feature engineering later. However, to give a general overview of the data within, there are 4 main groups of data:

It is important that the above data was grouped in such a manner because it allows myself to create logical data combinations when I preform wide networks later. Not all of the data above is useful which I will explain in a later section. I do believe there is enough useful data to provide a decent analysis using wide and deep networks

The objective of this data is to assess whether the bank product would be subscribed to or not. In other words, this was a campaign to gather interest in a new bank product put forth by the Potuguese institution (the business case)

Preparation

[1 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).

The first step I take is to remove data that we know we won't need and to give justification:

The second step is to analyze the categoricals and identify missing data. Missing data has been defined for us as "unknown". Replace unknown with an infered value of "no" for binary features, but keep it as "unknown" for multi-categorical. We can make the assumption that any binary data which contains null values will be defaulted to "No". Typically, a failure to respond on a binary value generally means a no, so therefore I will be using this as an assumption to fill in this type of data. We can not deduce the same for multi-categorical because multi-categorical, by nature, gives too many options to infer an "unknown" value. We will preserve "unknown" as its own category for multi-categorical features as a failure to answer for multi-categorical can mean something tangible.

The third step is to handle outliers for numerical and continuous features. The only one of great concern that appears to have "outliers" is pdays - which states that 999 is used for clients which have not been previously contacted about any campaigns prior to this campaign. I wish to convert 999 to 0, however we have records with 0 as the value for pdays - and the associated poutcome with 0 pdays is always a success, whereas the poutcome with pdays = 999 is nonexistent. To handle this, I will change all of the 0 pdays to 0.5 days and all of the 999 days to 0. I don't wish to null out the 999 days because I eventually will want to scale this column (which is not possible with null values). So, I am opting for a substitution method instead of trying to infer these values.

Based on the dataset desciption above, pdays is the only numerical column that will need any outlier handling

The fourth step is to convert categorical features in such a manner where we preserve the original values of the categories, but create an entirely new set of numeric versions of the categoricals. We preserve the original values because later we will be combining the qualitative values in each category when crossing the dataset. I will be utilizing some of the lecture code for this section

The fifth step is to convert the no/yes output variable "y" into numeric binary columns. This is important that we do not perform this similar exercise for our binary X features because when we cross combine columns later, the cross combinations expect strings, not integer values. I kept in the original code the first time I went through this just to showcase that I had ran into this issue much later in the notebook when I tried cross combining. Only our "y" output is converted to numeric.

The sixth step is to scale the datasets using a standard scaler. I choose a standard scaler because I feel confident that I've handled outliers appropriately for this exercise so there is no need for a robust scaler.

Describe the dataset

After completing the above steps, our dataset is ready to begin the process of splitting and going through a network. The dataset retains the original string value features of job, marital, education, and poutcome (to be later purposed for a wide network), while converting these columns to integer representations. All other fields which are not strings have been scaled appropriately.

Due to the nature of the output variable (binary) and distribution of values (36,548 class 0 vs 4,640 class 1) we need to be careful in how values are distributed between train and test sets. I chose to use a Random Under Sampler to create a dataset that contains roughly the same amounts of class 0 and class 1 instances. I chose this method because during my initial construction of my model, I noticed that the model was always predicting at around an 89% to 90% accuracy - a telltale sign that my model is predicting the largest class (class 0). I could have choosen a Random Over Sampling technique - but that introduces duplicating records and thus introduces overfitting of data. For the Under Sampling, I choose to have a slightly more amount of class 0 instances than class 1. I did this because during an initial run of my models, I found that the precision for class 1 was significantly higher than class 0 - indicating that there are not enough class 0 instances for the model to make accurate predictions. Finally, class 1 has around 4,600 instances, so the final dataset will contain around 11,000 instances which is a healthy amount of instances for this execise.

The final dataset contains 11,268 instances with 19 features and 1 output variable.

Cross-Features

[1 points] Identify groups of features in your data that should be combined into cross-product features. Provide justification for why these features should be crossed (or why some features should not be crossed).

The first group I will call the personal group. This group contains features that are personal to a client: job, marriage status and education. These make sense because these create a rough estimation of who a person is.

The second group of features for crossing will be the financial category. These features include default, housing, and loan. These features are appropriate together because this paints the picture of a clients financial health.

The third group of features will be a combination of the above two. All features will be included: job, marriage status, education, default, housing, and loan. These features are appropriate because it paints a fuller picture of a client - what life status they have (job, marriage, education) along with current financial picture (default, housing, loan). Because traditionally all 6 of these categories have been commonly used in banking, I find these appropriate for use in crossing the dataset.

Performance Metrics

[1 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.

I have implemented a function (similar to the one used in the previous lab) which assesses the Precision, Recall, and F1 scores. Out of the 3, the most important metric will be precision. Our aim is to lower our false positives as this model can be used to predict company KPIs. Company KPIs can be used to predict financial performances. If KPIs are overpredicted, company financials could be more optimistic than intended. This could have grave consequences on shareholder expectations. Therefore, from a financial standpoint, precision is the most important metric, followed closely by F1. Both metrics I will assess, but with precision being the more important of the two.

In addition to F1 and Precision, another appropriate metric to monitor is mean binary cross entropy: https://www.tensorflow.org/api_docs/python/tf/keras/metrics

The full list of metrics and losses I will be monitoring:

I have created four helper functions below to plot out the metrics we want to analyze as well as a confusion matrix and a McNemar test for model comparisons. The average metric function is used to extract the averages of a specified metric from the history of a model

Dataset Splitting

[1 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Argue why your cross validation method is a realistic mirroring of how an algorithm would be used in practice.

For my dataset split, I have decided on an 80/20 split as there are a fair amount of records, but not enough to justify a 90/10 split and not too few where a 70/30 split is necessary. If we stratify our output to have a proportional amount between train and test, we can introduce a risk of bias in where our model is expecting that same proportion to be seen with real data. As evident by the distribution before under sampling, we expect to see an outcome of 1 roughly 10% of the time. In other words, the campaign was successful in getting a sign for a new product by 10% of the population.

The approach, because our output is binary, is to keep the split and shuffle randomly without stratification. We will use a Cross Validation approach with a Logistic Regression classifier to see roughly what our expected precisions should be. Anything that is 90%, we should be cautious that our model is simply predicting class 0. Anything that is about 70% or greater, we can reason that the model is set up in a manner where it can generalize without class imbalance bias.

We will use a split of 5. Our dataset size is healthy at 10,000 instances, but the more we split, the less data we will have to validate. I have chosen to include a validation set per model per cross-fold (I will explain my validation threshold later).

As evident below, our splits show about 75% precision - which I am confident that our class balancing has allowed our model to not become biased towards predicting a single class. Further, a 75% precision indicates that the model is predicting a true positive rate 3 out of every 4 times. For an economic or financial forecasting model, this is really good. If I was right 3/4 times for my business decisions, I'd be very wealthy.

[2 points] Create at least three combined wide and deep networks to classify your data using Keras. Visualize the performance of the network on the training data and validation data in the same plot versus the training iterations. Note: use the "history" return parameter that is part of Keras "fit" function to easily access this data.

The approach is to create our base models first: A Deep NN and a Wide NN, then combine them into 3 models. Little analysis will be done for the first two models - I will make some comments, but don't treat these as models to be taken seriously. The final 3 models in this section will be the models that are worth scrutinizing.

Defining Model Parameters

I will be using the AdaM optimizer as I believe this is a safe an appropriate optimizer for binary classification. Our loss function will be MSE (I tetered with switching to Binary Crossentropy but ended up sticking with MSE). MSE is another safe loss metric to gauge performance and is often used for binary classifications. Our metrics have been defined above. Our kernel initializer will be glorot uniform distribution. I think glorot normal or uniform is the best. I have no reasoning behind a uniform distribution over a normal distribution other than I have never use the uniform distribution in practice so I thought it would be interesting. Each layer will have a tanh activation. Tanh vs Sigmoid is controversial. Industry best practice states tanh is the better option - I tend to agree. However, I kept the Sigmoid as the final activation function. Validation split will be 10%. If roughly 8,000 records are within our train dataset and we split the dataset 5 times, our validation set will be about 160 instances - which I view as the bare minimum for a validation set. Additionally, I think a validation set is important in order to gauge model performance per split.

Creating a deep network

As stated above, I won't go into too deep of an analysis as this is a setup model. However, it is worth noting that our average validation precision is about 79% which is very good considering we have further enhancements to our model.

Creating a Wide Network

As with our deep network above, these is not much to say about this model because it is used for setup. It is not surprising that the wide model has poorer precision and other metrics compared to its deep counterpart because the dataset is much narrower and does not provide a lot of sensible categories to combine together.

Combining Networks

We now combine the deep and wide networks above into a single network. For this section, we will create 3 different combo networks, each with the same amount of deep layers but altering the number of column combinations used:

  1. Will have only default-housing-loan column combination as I think this is a more telling set of features as opposed to job-marital-education
  2. Will have only job-marital-education and default-housing-loan column combinations but not the inter combination as network 1 has
  3. Will have all 3 column combinations

We only need to create a new concat branch that takes the last deep branch (before final branch) and last wide branch (before final branch).

For the model analysis, I will gather my thoughts after the 3rd Combo model is ran. The next couple sections will only contain code and no analysis. See the end of the 3rd Combo Model section for a full analysis

Combo Model 1

Combo Model 2

Combo Model 3

This is interesting. We would have expected the combination neural network with all 3 types of combinations as specified at the beginning of this notebook to be the superior model. But, per the analysis above, it's clear the combo model 2 is the better model based on our foundational metric, precision. Additionally, our confusion matrix supports our claim. Looking at the Precision column of our confusion matrix, we have healthy precisions as Class 0 with 73% and Class 1 at 70%. Only our first combination model out performed these metrics, but had a lower overall average precision.

Another concern is the increase in loss and the decrease in our performance metrics for our validation data. Could this be due to the number of neurons per layer? In any case, this is concerning as it appears all metrics are being overfitted. But I would argue it's not necessarily the case of overfitting because each validation metric is within 1-2% of the average of the train model metrics. Our best model moving forward will be combo model 2

Looking at our McNemar tests, we see that all 3 tests do not pass over 0. Therefore, with a 95% confidence, we can deduce that all models differ from eachother and we can not reject the null hypothesis.

Adding Deep Layers

[2 points] Investigate generalization performance by altering the number of layers in the deep branch of the network. Try at least two different number of layers. Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab to select the number of layers that performs superiorly

For this section, I have created 2 more models in which I add a single layer to the already existing combo model from above. I use only two combinations for this exercise. As in, you will see that the 2nd widest neural network in addition to a deep network of 4 layers will be the first model of this section. This 2nd widest neural net plus a deep net of 5 layers will be the second part of this section. I chose to use the 2nd widest net because it performed the best with a precision score of 81%. Similar to the above section, I will perform the final analysis after the Combo Model with Deep 5.

Combo Model with Deep 4

Combo Model with Deep 5

It appears that the deep 5 model is the best performing model based on precision, with a score of 82%. Although, it is interesting how the deep 4 model did not perform better than our best combo model, but deep 5 did. When looking at the training metrics, each metric makes sense. The curves make sense. But, when looking at validation, I can't help but think that maybe overfitting is still occurring. The magnitude at which we decrease for each epoch warrants an analysis into why we may be overfitting. It may be because I chose not to stratify the output variable "y", and because both the class 0 and class 1 were purposefully under sampled for the entire dataset at a specific proportional rate, I should have also stratified y. Thereby having the same proportions between both the train and test sets, as opposed to the current state where it is random. I will stand by this analysis however, even if it is wrong.

Best Model vs Deep Neural Network

[1 points] Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP). Alternatively, you can compare to a network without the wide branch (i.e., just the deep network). For classification tasks, compare using the receiver operating characteristic and area under the curve. For regression tasks, use Bland-Altman plots and residual variance calculations. Use proper statistical methods to compare the performance of different models.

For this section, I will be comparing my best model with the first deep model created above in the first section. There will be 3 metrics used to assess performance:

I believe each 3 have something to offer for analyses and are appropriate when identifying which model is the more superior performer. Full Analysis will be held at the end of this section

Based on the metrics chosen, I can state with confidence that our best neural network is 1. better performing by a relatively wide margin and 2. statistically different than the first deep neural network created at the beginning of this notebook. But, the concerning factor still looms: we are decreasing in performance per metric, indicating an overfit.

Exceptional Work: UMAP Dimensionality Reduction

One idea (required for 7000 level students): Capture the embedding weights from the deep network and (if needed) perform dimensionality reduction on the output of these embedding layers (only if needed). That is, pass the observations into the network, save the embedded weights (called embeddings), and then perform dimensionality reduction in order to visualize results. Visualize and explain any clusters in the data.

In this section, the idea is to extract each embedding layer name. Extract the weights associated with each embedded name, and then fit a UMAP transformer on top of these weights. Finally, we assess the magnitude of each of these weights and determine, with certainty, which embedded layer had the most sway and influence on the model. Full analysis will be held at the end of this section.

Embedding Names and Weights

Reduction plotting

When plotting the UMAP reduction representations per category, we can see distinct clusters. Each cluster does not overlap with each other and this type of reduction can be used to deduce the categories which drive the highest variance within the model. It is safe to say that each of these features are prominent in determing an outcome, but to what magnitude?

Embedded Layer Magnitudes

Interesting. We can see that the most impactful embedding layers are the marital embedded layer and the job-marital-education layer - although looking at the magnitude, it can be argued that all 5 categories are very impactful. In summary, we can deduce that these 2 most impactful embedding layers have the largest influence when predicting whether a client closes on a new product or not. Thus, we can allow the sales teams to target these specific categories when new products launch as we can be confident these categories give us greater predictive value.

Final Thoughts

Our most influential categorical values were marital status and job-marital-education status. Additionally, we have achieved around an 82% precision (with what appears to be an 80% accuracy) for the best performing model. In the context of a predictive financial model for new products, this is remarkable. Predicting money flows with a high precision and accuracy is extremely valuable and can lead to confident sales growth for a firm.